Persistent registry; lifecycle management by kaste · Pull Request #351 · packagecontrol/thecrawl

kaste · 2026-03-31T10:29:53Z

Let's try this one.

Created an orphaned branch the-registry, changed crawl.yml to read the registry from there, and to push it (if it changed) to that branch back.

generate_registry learned lifecycle management. New packages are tagged with "first_seen"; removed packages are re-added as tombstones.

A new tool generate_seed.py extracts "first_seen"/"removed" data from workspace.json files or registries.

Old data has been scraped from packagecontrol.io; so we have more tombstones in the database than before. Unlikely we have all; don't know what the actual policy was for pc.io.

Treat removed workspace entries without a stored source as coming from MAIN_REPOSITORY_SOURCE when enforcing ensure_secure_source(). This closes a takeover gap for imported tombstones that lacked source data. Keep the denial message honest by showing the persisted workspace value in diagnostics. When source is missing, report "<not-set>" instead of a synthesized trusted source. Add a deny-rules test that covers removed entries without source and asserts both denial behavior and the new message wording.

When crawl_package() raises, keep existing workspace state but ensure source is present by defaulting from the registry package contract. This keeps security behavior stable for denied source moves while also repairing entries that never had a successful crawl and thus missed source entirely. Add a regression test that verifies failed crawls adopt source from the registry entry when the existing workspace entry has no source.

Exclude registry entries with a removed field from the scheduler in both normal and presto modes. Also block explicit --name crawls for tombstoned packages with a clear message so manual runs follow the same tombstone rule. Add focused scheduler tests that verify removed entries are skipped and that the next-run hint ignores tombstoned packages.

Keep --name handling simple and explicit: if the selected registry package is tombstoned, print a clear message and return without crawl. Add a focused regression test for main_() that verifies tombstoned packages are rejected in name mode and workspace remains unchanged.

Teach explain_main() to treat tombstoned registry entries explicitly. For tombstones, print a clear status line to stderr. In normal mode, print the raw entry as pretty JSON. In EFFECTIVE mode, print only the status line and no JSON payload since there is no effective release view. Keep this path simple by inlining the tombstone JSON print instead of adding a helper wrapper.

Move the effective explain logic and its helper functions out of crawl.py and into _explain_package.py so the explain-specific code lives together in one place. As part of that extraction, move the shared sublime_text selector parsing helpers into _utils.py so both crawl runtime logic and explain logic use the same implementation. Update the explain tests to import the helper-facing functions from _explain_package.

Teach maintenance() to copy tombstoned registry entries into workspace.packages before the legacy orphan-marking step. This keeps removed packages present in workspace and intentionally overwrites stale crawl-only fields with the canonical tombstone data. Add focused maintenance tests for tombstone import, overwrite behavior, and continued orphan removed marking.

Add a regression test that imports a tombstoned package via maintenance(), then runs main_() with an active registry entry for the same name. Verify resurrection works without special-case code: the package is crawled, removed is cleared, source remains stable, and first_seen is preserved.

Add describe_registry_changes.py to generate commit-message text from old/new registry snapshots. Implement change classification for both packages and libraries, including single-change messages, metadata bulk edits, and mixed bulk edits with additions, tombstones, and resurrections. Keep repositories out of primary classification, but fall back to "Update registry.json" when repositories change without any entity change. This keeps "Same." strict so it only appears when no commit is needed. Add focused tests for all supported classifications and fallback cases, using loader mocking for CLI tests.

Implement implicit seed loading in generate_registry based on --output, with explicit overrides via --seed and opt-out via --no-seed. In seeded mode, preserve package first_seen, synthesize tombstones for missing packages, preserve tombstone removed timestamps, and keep resurrection first_seen. Libraries remain non-tombstoned. Keep fetching_source_failed behavior intact and add focused registry tests that cover seed/no-seed behavior, tombstones, resurrection, library handling, and deterministic package ordering.

Simplify seeded lifecycle handling after initial implementation. Use pick() for seed extraction, inline package sorting by name, and remove a no-op removed-field cleanup path. Also ensure first_seen is populated when missing, including for tombstoned entries, while still preserving seeded first_seen when available.

Clarify generate_registry seed semantics in both CLI help and README. The docs now explain implicit vs explicit --seed behavior and the interaction with --no-seed plus fetching_source_failed. Add scripts.seed_from_workspace as a first-class script and document its usage inline with generate_registry. The script emits sparse output for optional fields and avoids writing null source values.

Introduce scripts.generate_seed as the new seed extraction command. It accepts exactly one input source via a required mutually exclusive flag: --workspace [PATH] or --registry [PATH]. Update README examples to use generate_seed and document both supported input modes. Remove the old seed_from_workspace script.

Add an incomplete-shape warning in generate_seed based on expected entry sizes (2 keys for active entries, 5 for tombstoned entries). The warning triggers when more than 10% of entries are incomplete. Special-case the all-incomplete scenario with a clearer message: "All packages have an incomplete shape".

Separate lifecycle seeding from source-failure recovery in generate_registry. Recovery of failed repositories now requires registry-shaped data and no longer reconstructs entries from workspace/seed maps. If an explicit seed is not registry-shaped, the command falls back to prior --output when available for recovery data. On fetch failures, emit a focused warning when the seed knows package names but no full recovery entries exist, with message text that reflects --no-seed behavior. Add regression tests for non-registry seed input and fallback-to-output recovery behavior.

Use has_registry_shape directly in resolve_failure_recovery_db without additionally checking available. The shape flag already encodes successful load plus registry-compatible structure. This keeps behavior unchanged while reducing redundant conditions.

Replace the SeedLoad available/null-object pattern with SeedDb | None. This removes ambiguous truthiness handling and makes seed presence explicit at call sites. Also rename read_seed_db(explicit=...) to strict=... to better reflect its behavior: strict mode raises on read/parse errors, while implicit mode returns None.

Simplify lifecycle seed handling by building the name->entry map directly inside apply_seed_lifecycle and removing extract_seed_packages. Also generalize build_tombstone to accept Mapping[str, Any], keeping the key filtering in one place.

Replace generic Mapping typing for recovery_db with a dedicated RecoveryDb shape. This matches runtime guarantees and simplifies recovery iteration code. Use a TypeGuard for registry-shape detection and return RecoveryDb | None from resolve_failure_recovery_db.

Replace iter_db_entries(db, kind) with iter_package_entries(db), since this helper is only used for package traversal. This removes an unused selector parameter and makes call sites more explicit.

Avoid false-positive recovery warnings when a full registry-shaped seed is present but a failed repository has no known entries. Emit a clear generic warning only for compact seed.json inputs where failed-source recovery cannot be guaranteed, and point users to a full registry.json seed for complete recovery. Update and extend registry tests to cover compact-seed warning behavior and registry-seed non-warning behavior.

Update crawl workflow to seed generate_registry from ./.the-registry/registry.json and sync branch state via an extracted shell script. Add .github/workflows/sync_registry_branch.sh to perform compare-first syncing, fallback commit messaging, and push to the-registry. Add pytest coverage for happy path, no-op behavior, and classifier crash fallback using a local bare origin to avoid network pushes.

braver · 2026-04-01T19:36:01Z

don't know what the actual policy was for pc.io.

Not sure what it looks like in the data, but on the site removals are simply not handled (ie. nothing gets removed or even marked as such). Maybe sometimes Will did something manually sometimes but not in recent years. I removed this one ages ago for instance: https://packagecontrol.io/packages/Theme%20-%20Sea%20Lion.

kaste · 2026-04-01T22:00:19Z

Oh, they just look like pages that don't get downloads. How would I scrape them? It is basically just a page with an outdated LAST SEEN tag, so you go just through all the pages and look for that. Certainly possible.

kaste · 2026-04-01T22:05:41Z

Actually, this is typically to be installed on the package_control_channel repo; but we don't have enough permissions to run it over there and it'd be just a hassle.

braver · 2026-04-03T17:47:42Z

Maybe one day though! 🤞🏻

kaste added 30 commits March 31, 2026 12:21

Some type work

5792277

Internally rename variable db -> recovery_db

e6bdfcf

Inline seed package index for lifecycle merge

37f3104

Simplify lifecycle seed handling by building the name->entry map directly inside apply_seed_lifecycle and removing extract_seed_packages. Also generalize build_tombstone to accept Mapping[str, Any], keeping the key filtering in one place.

Simplify package-entry iteration helper

4fbb643

Replace iter_db_entries(db, kind) with iter_package_entries(db), since this helper is only used for package traversal. This removes an unused selector parameter and makes call sites more explicit.

Use Registry type in favor of RecoveryDb

8607efd

Pass seed down to apply_seed_lifecycle

e5f160c

Pass registry (db) down to apply_seed_lifecycle

e46166f

Remove two redundant str tests

aa05198

Pass seed down to fetch_packages

17dd245

WS

03cf536

kaste merged commit a825b0e into main Mar 31, 2026
3 checks passed

kaste deleted the peristent-registry branch March 31, 2026 10:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persistent registry; lifecycle management#351

Persistent registry; lifecycle management#351
kaste merged 30 commits intomainfrom
peristent-registry

kaste commented Mar 31, 2026 •

edited

Loading

Uh oh!

Uh oh!

braver commented Apr 1, 2026

Uh oh!

kaste commented Apr 1, 2026

Uh oh!

kaste commented Apr 1, 2026

Uh oh!

braver commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kaste commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

braver commented Apr 1, 2026

Uh oh!

kaste commented Apr 1, 2026

Uh oh!

kaste commented Apr 1, 2026

Uh oh!

braver commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kaste commented Mar 31, 2026 •

edited

Loading